37 research outputs found

    Arabic/Latin and Machine-printed/Handwritten Word Discrimination using HOG-based Shape Descriptor

    Get PDF
    In this paper, we present an approach for Arabic and Latin script and its type identification based onHistogram of Oriented Gradients (HOG) descriptors. HOGs are first applied at word level based on writingorientation analysis. Then, they are extended to word image partitions to capture fine and discriminativedetails. Pyramid HOG are also used to study their effects on different observation levels of the image.Finally, co-occurrence matrices of HOG are performed to consider spatial information between pairs ofpixels which is not taken into account in basic HOG. A genetic algorithm is applied to select the potentialinformative features combinations which maximizes the classification accuracy. The output is a relativelyshort descriptor that provides an effective input to a Bayes-based classifier. Experimental results on a set ofwords, extracted from standard databases, show that our identification system is robust and provides goodword script and type identification: 99.07% of words are correctly classified

    How to separate between Machine-Printed/Handwritten and Arabic/Latin Words?

    Get PDF
    This paper gathers some contributions to script and its nature identification. Different sets of features have been employed successfully for discriminating between handwritten and machine-printed Arabic and Latin scripts. They include some well established features, previously used in the literature, and new structural features which are intrinsic to Arabic and Latin scripts. The performance of such features is studied towards this paper. We also compared the performance of five classifiers: Bayes (AODEsr), k-Nearest Neighbor (k-NN), Decision Tree (J48), Support Vector Machine (SVM) and Multilayer perceptron (MLP) used to identify the script at word level. These classifiers have been chosen enough different to test the feature contributions. Experiments have been conducted with handwritten and machine-printed words, covering a wide range of fonts. Experimental results show the capability of the proposed features to capture differences between scripts and the effectiveness of the three classifiers. An average identification precision and recall rates of 98.72% was achieved, using a set of 58 features and AODEsr classifier, which is slightly better than those reported in similar works

    Reconnaissance de formules mathématiques Arabes par un système dirigé par la syntaxe

    Get PDF
    L'objet de cette contribution est de présenter un système dirigé syntaxe qui reconnaît des formules mathématiques Arabes et retourne les résultats de la reconnaissance dans le format MathML. Un ensemble de règles de remplacement est défini par une grammaire de coordonnées pour analyser des formules mathématiques Arabes. Cette grammaire est employée en s'appuyant sur la reconnaissance de symboles et l'analyse de leur arrangement spatial. Nous avons utilisé les k plus proches voisins pour reconnaître des symboles mathématiques Arabes et un analyseur syntaxique à la fois descendant et ascendant qui repose sur la dominance d'opérateurs pour diviser récursivement la formule en sous formules plus simples. Dans le système proposé, les modules de la reconnaissance des symboles et de l'analyse structurelle s'interagissent d'une manière étroite. Il est ainsi possible d'utiliser des informations structurelles pour aider à deviner les symboles ambigus ou en confusion. Ce système de reconnaissance, dirigé par la syntaxe, a été démontré avec succès sur plusieurs types de formules se trouvant dans différents documents scientifiques Arabes

    A System for an automatic reading of student information sheets

    Get PDF
    ISBN: 978-1-4577-1350-7International audienceIn this paper we present a student information sheet reading system. Relevant algorithm is proposed to locate and label handwritten answer field. As information sheets can be filled in Arabic and/or in French, automating the script language differentiation is a pre-recognition required in the proposed system. We have developed a robust and fast field classification and script language identification method, based on a decision tree, to make these processing practical for sheet recognition. To this end, the system uses several novel features (loops, descenders, diacritics) and analyses the lower profile of script. The classification rates are 92.5% for numeric fields, 94.34% for Arabic scripts and 94.66% for French scripts. Experimental results, carried on 80 sheets, show our system provides an effective way to convert printed sheets into computerized format or collect information for database from printed sheets

    Structural Features Extraction for Handwritten Arabic Personal Names Recognition

    Get PDF
    International audienceDue to the nature of handwriting with high degree of variability and imprecision, obtaining features that represent words is a difficult task. In this research, a features extraction method for handwritten Arabic word recognition is investigated. Its major goal is to maximize the recognition rate with the least amount of elements. This method incorporates many characteristics of handwritten characters based on structural information (loops, stems, legs, diacritics). Experiments are performed on Arabic personal names extracted from registers of the national Tunisian archive and on some Tunisian city names of IFN-ENIT database. The obtained results presented are encouraging and open other perspectives in the domain of the features and classifiers selection of Arabic Handwritten word recognition

    Segmentation of Touching Component in Arabic Manuscripts

    Get PDF
    International audience— Touching components are connection zones occurring between text-lines or words of the same line and are one of the problems that make unconstrained handwritten text segmentation greatly hard. In this paper, we propose a recognition based method to separate these components once localized in Arabic manuscript images. It first identifies, for a given touching component, a similar model stored in a dictionary with its correct segmentation, using shape context descriptor and an interpolation function. Then, it segment the touching component based on the distance from the midpoints of the identified model's parts. Tests are performed using a database of touching components and two metrics: Manhattan and Euclidean distances. Experimental results show the effectiveness of the proposed segmentation method

    A Novel Approach for the Recognition of a wide Arabic Handwritten Word Lexicon

    Get PDF
    International audienceThis paper introduces a novel approach for the recognition of a wide vocabulary of Arabic handwritten words. Note that there is an essential difference between the global and analytic approaches in pattern recognition. While the global approach is limited to reduced vocabulary, the analytic approach succeeds to recognize a wide vocabulary but meets the problems of word segmentation especially for Arabic. Combining the neuronal approach with some linguistic characteristics of the Arabic, it is expected that we become able to recognize better and to handle a large vocabulary of Arabic handwritten words. The proposed approach invokes two transparent neuronal networks, TNN_1 and TNN_2, to respectively recognize roots, schemes and the elements of conjugation from the structural primitives of the words. The approach was evaluated using real examples from a data base established for this purpose. The results are promising, and suggestions for improvements are proposed

    A Neural-Linguistic Approach for the Recognition of a Wide Arabic Word Lexicon

    Get PDF
    International audienceRecently, we have investigated the use of Arabic linguistic knowledge to improve the recognition of wide Arabic word lexicon. A neural-linguistic approach was proposed to mainly deal with canonical vocabulary of decomposable words derived from tri-consonant healthy roots. The basic idea is to factorize words by their roots and schemes. In this direction, we conceived two neural networks TNN_R and TNN_S to respectively recognize roots and schemes from structural primitives of words. The proposal approach achieved promising results. In this paper, we will focus on how to reach better results in terms of accuracy and recognition rate. Current improvements concern especially the training stage. It is about 1) to benefit from word letters order 2) to consider "sisters letters" (having same features), 3) to supervise networks behaviours, 4) to split up neurons to save letter occurrences and 5) to solve observed ambiguities. Considering theses improvements, experiments carried on 1500 sized vocabulary show a significant enhancement: TNN_R (resp. TNN_S) top4 has gone up from 77% to 85.8% (resp. from 65% to 97.9%). Enlarging the vocabulary from 1000 to 1700 by 100 words, again confirmed the results without altering the networks stability

    Arabic Handwritten Documents Segmentation into Text-lines and Words using Deep Learning

    Get PDF
    International audienceOne of the most important steps in a handwriting recognition system is text-line and word segmentation. But, this step is made difficult by the differences in handwriting styles, problems of skewness, overlapping and touching of text and the fluctuations of text-lines. It is even more difficult for ancient and calligraphic writings, as in Arabic manuscripts, due to the cursive connection in Arabic text, the erroneous position of diacritic marks, the presence of ascending and descending letters, etc. In this work, we propose an effective segmentation of Arabic handwritten text into text-lines and words, using deep learning. For text-line segmentation, we used an RU-net which allows a pixel-wise classification to separate text-lines pixels from the background ones. For word segmentation, we resorted to the text-line transcription, as we have not got a ground truth at word level. A BLSTM-CTC (Bidirectional Long Short Term Memory followed by a Connectionist Temporal Classification) is then used to perform the mapping between the transcription and text-line image, avoiding the need of the input segmentation. A CNN (Convolutional Neural Network) precedes the BLST-CTC to extract the features and to feed the BLSTM with the essential of the text-line image. Tested on the standard KHATT Arabic database, the experimental results confirm a segmentation success rate of no less than 96.7% for text-lines and 80.1% for words

    Arabic/Latin and Machine-printed/Handwritten Word Discrimination using HOG-based Shape Descriptor

    Get PDF
    In this paper, we present an approach for Arabic and Latin script and its type identification based onHistogram of Oriented Gradients (HOG) descriptors. HOGs are first applied at word level based on writingorientation analysis. Then, they are extended to word image partitions to capture fine and discriminativedetails. Pyramid HOG are also used to study their effects on different observation levels of the image.Finally, co-occurrence matrices of HOG are performed to consider spatial information between pairs ofpixels which is not taken into account in basic HOG. A genetic algorithm is applied to select the potentialinformative features combinations which maximizes the classification accuracy. The output is a relativelyshort descriptor that provides an effective input to a Bayes-based classifier. Experimental results on a set ofwords, extracted from standard databases, show that our identification system is robust and provides goodword script and type identification: 99.07% of words are correctly classified
    corecore